Back

Nature Biotechnology

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match Nature Biotechnology's content profile, based on 147 papers previously published here. The average preprint has a 0.35% match score for this journal, so anything above that is already an above-average fit.

1
DAMPA - accelerated and simplified design of probe panels for targeted metagenomics using pangenome graphs

Payne, M.; Tam, K. K.-G.; Rockett, R. J.; Basile, K.; Bowden, R.; Sintchenko, V.; Kok, J.; Golubchik, T.

2026-05-22 infectious diseases 10.64898/2026.05.15.26352859 medRxiv
Top 0.1%
42.5%
Show abstract

Targeted metagenomics, where samples are enriched for multiple organisms of interest using oligonucleotide probes, is a highly efficient sequencing methodology that is becoming standard practice for genomics of viruses and complex polymicrobial samples. Efficient enrichment critically requires probes that capture both conserved and highly diverse genomic regions without loss of sensitivity, and with uniform representation in the sequencing pool. Design of optimal probesets poses a challenge: existing computational methods use k-mer hashing to reduce over-abundant sequences, but scalability and efficiency drop with increasing numbers of genomes, while diverse sequences remain under-represented. Here we show that incorporating evolutionary distance to compress probes via a graph-based representation of multiple genomes across species, together with k-mer hashing, reduces overrepresentation of conserved sequences, and yields more uniform coverage even of highly diverse loci. We make the method available in Dampa, an open-source tool that generates probesets in seconds on a standard laptop.

2
Dibenzocyclooctyne-modified PCR primers enable direct enzyme-free click chemistry ligation for custom nanopore amplicon sequencing

Lypaczewski, P.; Shapiro, B. J.

2026-04-21 genomics 10.64898/2026.04.18.719403 medRxiv
Top 0.1%
40.7%
Show abstract

Oxford Nanopore Technologies (ONT) rapid library preparation kits use transposase-mediated tagmentation to attach click chemistry functionalized oligonucleotide duplexes to fragmented DNA, followed by click chemistry to conjugate Rapid Adapter (RA) sequencing adapters. A similar protocol is used in 16S rRNA gene amplicon and PCR-amplified rapid whole-genome sequencing workflows. Here, we describe custom oligonucleotides with dibenzocyclooctyne (DBCO) added onto PCR primer 5' termini. After standard PCR amplification, DBCO-modified amplicons react spontaneously with RA sequencing adapters, producing sequencing-ready libraries in minutes without enzymatic processing. All configurations employ an asymmetric design in which the DBCO modification is restricted to a single primer, leaving the opposite primer available for barcoding at low cost. We validate three primer architectures: (i) direct attachment of DBCO to a target-specific primer, (ii) a universal DBCO-modified oligonucleotide used in a two-step PCR workflow, and (iii) a three-primer single-pot reaction combining the universal DBCO oligonucleotide with unmodified target-specific primers. These configurations are validated using full-length 16S rRNA gene amplicons sequenced on a PromethION flow cell. DBCO-modified primers are synthesized either commercially or in-house via DBCO-TFP ester conjugation to 5'-amino oligonucleotides and remain fully active through standard PCR thermocycling. The best-performing configuration used a two-step PCR with a universal oligonucleotide and achieved higher pore occupation and reads than comparable commercial solutions. This approach reduces library preparation reagent costs compared to available kits, as the initial synthesis cost is lower than existing amplicon sequencing kits, while providing enough material for hundreds or thousands of PCR reactions. This is further applicable to an unlimited number of gene targets beyond 16S sequencing.

3
ESPeR-seq: Extremely Sensitive and Pure, End-to-end, RNA-seq library preparation

Chen, H.-M.; Kao, J.-C.; Yang, C.-P.; Tan, C.; Lee, T.; Sugino, K.

2026-03-15 genomics 10.64898/2026.03.12.711386 medRxiv
Top 0.1%
40.2%
Show abstract

The Smart-seq family of methods represents the gold standard for high-sensitivity, full-length single-cell RNA sequencing. Despite iterative improvements, fundamental challenges remain: the generation of non-specific PCR products that limit sensitivity, the inability to capture precise Transcription End Sites (TES), and the insidious generation of "phantom UMIs"--artificial molecular barcodes created during PCR that systematically inflate molecular counts. Here, we present ESPeR-seq, a novel architecture that resolves these barriers. To enable precise, stranded TES capture, we developed an "Omega-dT" primer that bypasses synthetic poly-T tracts, restoring high-quality sequencing directly at transcript termini. To eliminate both PCR background and phantom UMIs, we implemented a biochemical "multi-lock" mechanism utilizing uracil-containing TSOs and a uracil-intolerant DNA polymerase. We validate this approach using the logQ-slope, a novel metric that sensitively diagnoses UMI fidelity. Benchmarking reveals that while state-of-the-art methods still exhibit signs of UMI inflation, ESPeR-seq strictly prevents it. Furthermore, the strandedness and precise end-delineation provided by TSO and dT reads support robust de novo gene model reconstruction, enabling the discovery of novel multi-exon genes, unannotated 3 UTR extensions, and candidate eRNAs across aggregated single-cell populations. Thus, ESPeR-seq establishes a robust framework for absolute quantitative accuracy and full-length isoform resolution.

4
CMS: Achieving Uniform and High-Quality Sequencing across Challenging Non-canonical Genomic Regions

Li, Q.; Liu, L.; Lin, Q.; Dan, X.; Jiang, Y.; Wei, Y.; Yang, M.; Peng, X.; Luo, W.; Wang, W.; Xu, D.; Huang, Z.; Sun, W.; Zhao, L.; Yan, Q.; Sun, L.; Feng, B.

2026-04-28 genomics 10.64898/2026.04.24.720553 medRxiv
Top 0.1%
39.9%
Show abstract

High-throughput sequencing is essential in modern biological research, yet low-complexity sequences remain challenging as they form structurally complex, non-canonical (non-B) DNA conformations that impede sequencing enzyme read-through. This leads to a long-standing trade-off: maximizing coverage introduces false positives (FP), while stringent filtering causes coverage loss and false negatives (FN). To address this, we developed CMS (Cross Mountains and Seas) on GeneMind sequencing platforms by optimizing its chemistry and enzymatic systems to traverse these secondary structures with high fidelity. Benchmarking across whole-genome (WGS) and whole-exome (WES) sequencing demonstrates that CMS addresses the trade-off by simultaneously enhancing both coverage uniformity and accuracy, notably achieving an approximately 100-fold reduction in low-coverage bins for WGS and a 70% reduction in FN insertions/deletions (INDELs) within complex non-B regions. Specifically, a synthetic G-quadruplex (G4) motif sequencing experiment demonstrates that CMS maintains a 1:1 strand ratio, effectively handling G4-induced biases where benchmarked platforms exhibit extensive depletion. These findings establish CMS as a reliable technology for the precise characterization of structural-challenging but functional-essential genome regions.

5
Structure-Led Exploration of the Metagenome Yields Novel RNA-Guided Nucleases with Broad PAM Diversity

de los Santos, E. L.; Rieber, L.; Wang, M.; Catherman, S.; Hatfield, S.; Bowen, T.

2026-03-29 genomics 10.64898/2026.03.27.714800 medRxiv
Top 0.1%
34.2%
Show abstract

CRISPR-Cas bacterial adaptive immune systems use reprogrammable RNA guide sequences to specifically bind and cleave nucleic acids, which have been repurposed for easy and relatively efficient genomic editing. Despite its widespread use in biomedical research, the large size of Cas9 hinders AAV-mediated therapeutic delivery. Smaller RNA-guided nucleases could improve AAV gene therapy delivery, but their application is limited by their rarity among bacterial genomes and the restrictive sequence preferences of known systems, especially compared to the diversity of PAMs seen in the highly abundant Cas9 systems. Existing methods for identification of novel CRISPR subtypes rely on sequencing ever more bacterial genomes and comparing sequence homology. Using recent advances in protein structure prediction and comparison, we have identified and characterized proteins from known and novel compact RNA guided nucleases and demonstrated that their PAM preference diversity meets or exceeds that of Cas9 systems or the compact IscB and TnpB systems. This discovery has enabled us to demonstrate editing in eukaryotic cells with multiple novel subtypes, which--together with their compact size, varied PAM sequences, and high specificity--make them attractive tools for in vivo genome editing

6
Sequencing depth overcomes extraction bias: repurposing human WGS data for salivary microbiome profiling

Velo-Suarez, L.; Herzig, A. F.; Bocher, O.; Le Folgoc, G.; Le Roux, L.; Delmas, C.; Zins, M.; Deleuze, J.-F.; Hery-Arnaud, G.; Genin, E.

2026-04-01 genomics 10.64898/2026.03.27.714786 medRxiv
Top 0.1%
32.3%
Show abstract

Large-scale human genomic projects have generated whole-genome sequencing (WGS) data from hundreds of thousands of individuals, primarily to study host genetic variation. When saliva is the DNA source, the resulting datasets also contain microbial reads that are routinely discarded. Here, we investigate whether these host-centric WGS workflows can yield reliable microbiome profiles, effectively doubling the research value of existing data without additional sampling. We compared non-human reads from 39 deeply sequenced saliva samples from the GAZEL cohort (miG dataset; median [~]43 million reads/sample) with 14 samples processed with microbiome-optimized extraction (ASAL; median [~]4.3 million reads/sample), using two complementary classifiers: meteor, a coverage-based mapper against a curated saliva-specific database, and sylph, a k-mer classifier against the Genome Taxonomy Database (GTDB). Despite the absence of microbial lysis optimization, miG samples showed up to 3-fold higher species richness, [~]10-fold greater sequencing depth, and significantly lower inter-sample variability (PERMANOVA R{superscript 2} = 0.10, p = 0.001; BETADISPER p = 0.0036). Rarefaction to 10 reads eliminated most compositional differences, demonstrating that sequencing depth is the primary driver of community stability. Only [~]2% of detected taxa (12 of 592) showed extraction-related differences. The two classifiers exhibited fundamentally different depth-sensitivity profiles, with sylph retaining systematic detection asymmetries even after depth normalization, highlighting that classifier choice introduces biases that affect cross-study comparisons. These results show that biobank WGS data from saliva can be repurposed for robust, population-scale oral microbiome analyses, enabling simultaneous investigation of host genomic variation and the microbiome from the same archived samples. ImportanceSaliva-based whole-genome sequencing datasets generated across various cohorts to study human genetics contain non-human reads that are routinely discarded, thereby overlooking valuable microbial information. We show that these reads are sufficient to reconstruct robust oral microbiome profiles -- without any additional sampling or laboratory work. This finding unlocks a vast archive of existing genomic data for retrospective microbiome research, enabling population-scale studies of oral microbial diversity, host-microbiome interactions, and disease associations at minimal additional cost. We further demonstrate that the choice of taxonomic classifier introduces systematic, depth-dependent biases that persist even after normalization, a practical consideration for any cross-cohort or multi-platform microbiome study.

7
upSPLAT: Early-Barcoded Library Preparation for Cost-Effective Population-Scale Genomics

Raine, A.; Daniels, R. J.; Kjellin, J.; Wiman, A.-C.; Liljedahl, U.; Ramsell, J.; Wheat, C. W.; Gotthard, K.; Pettersson, M. E.; Andersson, L.; Nordlund, J.

2026-05-13 genomics 10.64898/2026.05.09.723775 medRxiv
Top 0.1%
28.4%
Show abstract

Advances in high-throughput sequencing have substantially reduced sequencing costs, yet library preparation remains a major financial and logistical bottleneck, particularly for high-throughput applications or low-quality DNA inputs. Here, we introduce upscaled Splinted Ligation Adapter Tagging (upSPLAT), a library preparation strategy that combines early sample barcoding with single-stranded splinted ligation to enable highly multiplexed pooled sequencing at substancially reduced cost. upSPLAT supports flexible high-plex pooling and reduces per-sample library preparation costs by approximately 10-fold compared to conventional workflows. By leveraging single-strand ligation, upSPLAT is compatible with a wide range of DNA inputs, including degraded, damaged or denatured double stranded DNA, bisulfite or enzymatically converted DNA, and viral single-stranded DNA. We present two complementary workflows and evaluate their performance across multiple species and DNA qualities, demonstrating robust demultiplexing, uniform sample representation, and low barcode cross-assignment. Together, upSPLAT provides a scalable, cost-effective solution for sequencing-based studies requiring large sample numbers while preserving individual-level information.

8
The Common Fund Data Ecosystem (CFDE)

Jurgens, J. A.; Bueckle, A.; Vora, J.; Maurya, M. R.; Mohseni Ahooyi, T.; Zheng, E.; Stear, B.; Wang, D.; Ree, C.; Ramachandran, S.; Nekrutenko, A.; Brandes, M.; Thaker, S.; Katz, D. H.; Munoz-Torres, M. C.; Diamant, I.; Chun, H.-J. E.; Simmons, J. A.; Tasian, S. K.; Jenkins, S. L.; Evangelista, J. E.; Dodia, H.; Saha, S.; Lindquist, M. A.; Gajjala, V.; Nemarich, C.; Zhen, J.; Ross, K. E.; Byrd, A. I.; Shilin, A.; Metzger, V. T.; Bologa, C. G.; Srinivasan, S.; Jang, D.; Kumar, P.; Taub, L. D.; Levanto, M. P.; Petrosyan, V.; Anandakrishnan, M.; Kim, M.; Clarke, D. J. B.; Ivich, A.; Crichton, D.

2026-04-12 scientific communication and education 10.64898/2026.04.10.717672 medRxiv
Top 0.1%
28.4%
Show abstract

The NIH Common Fund Data Ecosystem (CFDE) integrates data resources from 18 NIH Common Fund programs for discovery and integrative analysis. These programs generate valuable but heterogeneous datasets that can be difficult to discover, access, and reuse. CFDE aims to provide a collaborative, community-built infrastructure that links and enriches Common Fund programs. We describe the evolution, structure, and core technologies of CFDE, including practical approaches that support submission, integration, visualization, and public release of multimodal data. Training programs and workforce initiatives lower barriers to adoption. CFDE has devised solutions to critical issues facing cross-program initiatives, including data scale and heterogeneity, dataset integration, and long-term sustainability. We demonstrate the utility of linking Common Fund resources through integrative tools and cross-dataset queries to yield insights that would otherwise be infeasible. Collectively, CFDE shows that a standards-driven, federated approach enhances and unifies cross-disciplinary resources, fostering collaboration and data-driven discovery.

9
BenchDrop-seq: a microfluidics-free platform for benchtop single-cell long-read RNA sequencing

Bregman, J.; Nichols, C.; Ramisetti, R.; Srivastava, A.

2026-03-12 genomics 10.64898/2026.03.12.706999 medRxiv
Top 0.1%
28.3%
Show abstract

Single-cell long-read RNA sequencing enables direct measurement of full-length transcripts but has remained difficult to deploy at scale due to reliance on microfluidic barcoding, specialized instrumentation, and high per-cell cost. Here we present BenchDrop-seq, a benchtop platform for single-cell long-read transcriptomics that leverages particle-templated partitioning for single-cell molecular barcoding and couples this workflow to Oxford Nanopore sequencing for full-length transcript capture. By integrating established bead-based partitioning chemistry with long-read sequencing and a dedicated open-source analysis pipeline for barcode recovery, alignment, and transcript quantification, BenchDrop-seq enables isoform-resolved measurements from thousands of individual cells using standard laboratory equipment. We validate the platform in both a homogeneous cell line and a heterogeneous primary tissue, demonstrating high barcode recovery, accurate gene-level quantification, and reproducible detection of cell-type-specific transcript usage that is not readily accessible to short-read assays. Together, BenchDrop-seq establishes a practical and accessible framework for single-cell long-read RNA sequencing, lowering experimental barriers while enabling transcript-level analyses in routine single-cell experiments.

10
Adaptive sampling-based enrichment enables genome reconstruction of intracellular symbionts despite host background and reference divergence

Huang, W.-K.; Yang, C.-H.; Chung, H.; Lee, Y.-C.; Wu, Y.-C.; Chen, Y.-T.; Wan, M.-H.; Yeh, W.-S.; Hong, Y.-P.; Wu, T.-H.; Li, J.-C.; Liu, W.-L.; Chen, C.-H.; Chen, Y.-T.

2026-03-27 genomics 10.64898/2026.03.25.714109 medRxiv
Top 0.1%
28.1%
Show abstract

Recovering genomes of intracellular microbes from host-dominated samples remains a major challenge in microbial genomics, due to low target abundance, overwhelming host DNA, and the inability to culture these organisms independently. Despite extensive interest in Wolbachia, efficient genome recovery directly from host tissues remains limited by the inefficiency of host-dominated sequencing and the constraints of existing enrichment strategies. Here, we demonstrate that Oxford Nanopore adaptive sampling (AS) enables efficient, real-time enrichment of target DNA directly from complex host tissues, providing a culture-free approach for genome recovery in such systems. To our knowledge, this represents the first application of enrichment-mode adaptive sampling to achieve de novo reconstruction of an intracellular endosymbiont genome in a mosquito system. Using Aedes aegypti mosquitoes infected with a locally derived wAlbB-like strain, we applied enrichment-mode AS to selectively sequence Wolbachia DNA. This resulted in an increase from <1% Wolbachia reads in conventional shotgun data to [~]90% under adaptive sampling. De novo assembly of AS-enriched long reads yielded a near-complete genome ([~]1.5 Mb) in two contigs with >96-99% completeness. Comparative analyses revealed multiple large-scale chromosomal rearrangements relative to the reference wAlbB genome, demonstrating that adaptive sampling does not impose reference-dependent genome structure. Annotation further identified three prophage-associated regions, including two strain-specific expansions absent from the reference genome. Notably, cytoplasmic incompatibility genes (cifA and cifB) were identified adjacent to one of these regions, consistent with their known genomic association with prophage elements. Importantly, adaptive sampling remained effective despite substantial structural divergence between the reference and target genomes, revealing an unexpectedly robust application of this approach beyond its presumed operating conditions. Together, these results establish enrichment-mode adaptive sampling as a robust and scalable strategy for genome-resolved analysis of intracellular bacteria in host-associated systems.

11
Scalable Microbiome Network Inference: Mitigating Sparsity and Computational Bottlenecks in Random Effects Models

Roy, D.; Ghosh, T. S.

2026-03-31 bioinformatics 10.64898/2026.03.27.714858 medRxiv
Top 0.1%
27.8%
Show abstract

The application of Large Language Models (LLMs) and Transformers to biological and healthcare datasets requires the extraction of highly accurate, noise-filtered ecological networks. The Random Effects Model (REM) is a powerful statistical method for inferring microbial interaction networks and identifying keystone species across heterogeneous studies. However, existing implementations in R that rely on single-threaded "Iteratively Reweighted Least Squares" (IRLS) are computationally prohibitive for high-dimensional metagenomic data, creating a significant bottleneck for downstream machine learning pipelines. In this paper, we present Parallel-REM, a highly scalable, Python-based parallel pipeline accelerating large-scale network inference. By integrating robust variance filtering, sparsity checks, and a batched Master-Worker parallelisation strategy using joblib and statsmodels, we resolve native convergence failures associated with sparse biological matrices. Benchmarking on a massive clinical dataset comprising 70,185 samples and 466 optimal species demonstrates a 26.1x speedup over sequential baselines on a 64-core architecture, reducing computation time from days to minutes. Furthermore, statistical validation shows > 99.9% directional concordance with the original R implementation. Parallel-REM democratises largescale network extraction, providing the high-throughput infrastructure necessary to feed clean, topological and biological features into modern deep learning and Transformer-based diagnostic architectures.

12
Functional characterisation of an essential neo-chromosome III in Sc2.0 strain reveals opportunities and challenges for genome minimisation in Sc3.0

Swidah, R.; Monti, M.

2026-04-22 genomics 10.64898/2026.04.20.719597 medRxiv
Top 0.1%
27.4%
Show abstract

AbstractLarge-scale genome minimisation in eukaryotes remains a major challenge due to essential genes embedded within deletion-refractory regions and pervasive synthetic lethal interactions. Here, we address these limitations by engineering an essential neo-chromosome III that relocates all 14 essential genes from synthetic chromosome III onto a separate chromosome, thereby enabling further minimisation of synIII. To further expand design space, we created highly synthetic neo-chromosome variants with sequences absent from natural genomes. We refactored essential gene expression using both native and orthogonal promoter-terminator pairs from Saccharomyces paradoxus and S. eubayanus. Reporter assays showed that orthogonal regulatory elements largely recapitulate S. cerevisiae activity. Both architectures restored viability in essential gene deletion libraries, demonstrating robust cross-species complementation. Engineered linear and circular forms of essential neo-chromosomes were highly stable over 100 generations and supported a near wild-type phenotype. Relocating essential functions enabled SCRaMbLE-mediated deletion of previously inaccessible regions, substantially expanding the deletion landscape. To improve the screening efficiency of SCRaMbLEd strains, we developed a SCRaMbLE reporter, ERICA (Elementary Random Integration Cassette), a loxPsym-flanked URA3 cassette that integrates randomly and enables iterative selection. Nanopore sequencing confirmed complex rearrangements, including deletions of up to [~]40 kb and loss of essential loci. Together, this work establishes a modular and extensible platform for orthogonal essential gene engineering and SCRaMbLE-enabled genome reduction, providing key design principles for next-generation synthetic eukaryotic genomes. These findings have broad implications beyond yeast, providing transferable design principles for genome minimisation in more complex eukaryotic systems, including mammalian and human cells.

13
Droplet-compatible single-cell DNA methylation sequencing with PreTIC

Zhang, S.; Wang, F.; Zhang, Y.; Lee, K.-J.; Engman, C.; He, H.-Z.; Fan, Y.; Zheng, S.-Y.

2026-04-18 genomics 10.64898/2026.04.15.718726 medRxiv
Top 0.1%
27.3%
Show abstract

Single-cell DNA methylation sequencing has remained technically specialized due to challenges of interfacing conversion chemistry with droplet microfluidics. We introduce pre-tagmentation in situ conversion (PreTIC), which enables whole-methylome profiling on a commercial droplet platform using off-the-shelf reagents. With PreTIC, we produce over 13,000 single-cell methylomes from fixed frozen cells in two days and resolve cell-specific epigenetic variation in human peripheral blood mononuclear cells.

14
MERFISH 2.0, an ultra-sensitive single-cell spatial transcriptomics imaging chemistry across diverse tissue types

He, J.; He, L.; Wang, B.; Wiggin, T.; Chen, R.; Wang, H.; Yang, B.; Tattikota, S. G.; Maziashvili, L.; Zhang, T.; Revuru, S.; Wang, S.; Patil, S.; Sun, Y.; Sun, Y.; Li, M.; Cai, Y.; Wu, L.; Pentrenko, N.; Vasaturo, A.; Ray, M.; Emanuel, G.

2026-03-06 genomics 10.64898/2026.03.06.710199 medRxiv
Top 0.1%
26.9%
Show abstract

Spatial transcriptomics has emerged as a transformative approach for elucidating tissue architecture, cellular heterogeneity, and disease mechanisms by preserving the spatial context of gene expression in cells. Despite these advances, many spatial transcriptomic methods underperform in archival or clinically relevant specimens, particularly formalin-fixed, paraffin-embedded (FFPE) tissues, where RNA degradation and crosslinking hinder transcript detection. To address these challenges, we developed Multiplexed Error Robust Fluorescence In Situ Hybridization 2.0 (MERFISH 2.0), an optimized spatial transcriptomic imaging chemistry to enhance profiling of fragmented and highly crosslinked RNA. Across diverse human and mouse tissues preserved as fresh-frozen, fixed-frozen, and FFPE specimens, MERFISH 2.0 substantially increased transcript detection sensitivity by up to [~]8-fold relative to MERFISH 1.0 while preserving quantitative concordance (Pearson r [&ge;] 0.8 across tissues). In archived fresh-frozen human brain samples, MERFISH 2.0s enhanced sensitivity improved transcript recovery, enhanced cell type resolution and spatial analyses. In low quality archival FFPE human breast cancer specimen, MERFISH 2.0 revealed additional cell populations, novel cell clusters, refined tumor-immune architecture, and increased detection of gene-gene and cell-cell interactions relative to MERFISH 1.0, underscoring the impact of improved sensitivity on downstream spatial analysis. By substantially expanding robust transcript detection to degraded and archival samples, MERFISH 2.0 enables scalable, cohort-level spatial transcriptomic analysis across clinically relevant tissue collections.

15
REMAG: recovery of eukaryotic genomes from metagenomic data using contrastive learning

Gomez-Perez, D.; Raguideau, S.; Warring, S.; James, R.; Hildebrand, F.; Quince, C.

2026-03-08 bioinformatics 10.64898/2026.03.05.709928 medRxiv
Top 0.1%
25.7%
Show abstract

Metagenome-assembled genomes (MAGs) are central to exploring microbial communities. Yet, despite the relevance of protists and fungi to diverse ecosystems, eukaryotic MAG recovery lags behind that of prokaryotes. A major bottleneck is that most state-of-the-art binning pipelines exclusively rely on prokaryotic single-copy core gene reference databases and are optimized for smaller genomes. To address this gap, we present REMAG (Recovery of Eukaryotic MAGs), a tool designed to recover high-quality eukaryotic genomes suited for long-read metagenomic data. REMAG leverages fine-tuned HyenaDNA genomic foundation models to efficiently filter eukaryotic contigs. It then employs a dual-encoder Siamese network trained with Barlow Twins contrastive loss to learn a shared embedding space by integrating contig composition and differential coverage. Finally, high-quality bins are extracted using greedy iterative Leiden clustering optimized with eukaryotic single-copy core gene constraints. In benchmarks based on simulated mixed prokaryotic/eukaryotic communities and real datasets of varying sizes and origin, we demonstrate REMAGs ability to recover more near-complete eukaryotic genomes than existing state-of-the-art tools, which often produce highly fragmented eukaryotic bins. REMAG provides an automated eukaryotic binning method that scales effectively with the increasing size and sequencing depth of metagenomic datasets.

16
DIANA: Deep Learning Identification and Assessment of Ancient DNA

Duitama Gonzalez, C.; Lopopolo, M.; Nishimura, L.; Faure, R.; Duchene, S.

2026-04-10 bioinformatics 10.64898/2026.04.09.717429 medRxiv
Top 0.1%
25.6%
Show abstract

The field of ancient metagenomics provides insights into past microbiomes, but with a growing dataset size, methods that rely on reference databases have limited scope. Here, we introduce DIANA, a multi-task neural network that predicts key metadata categories from unitig abundances. Trained on 2,597 run accessions (1.72 Tbp of assembled unitig sequences), DIANA accurately identifies sample host (94.6%), community type (90.0%), and material (88.9%) on held-out test data and demonstrates robust generalisation on an independent validation set. A key innovation is DIANAs ability to perform semantic generalisation, correctly classifying samples with labels unseen during training -- such as novel subspecies -- to their appropriate parent categories. By leveraging both known and uncharacterized genomic sequences, DIANA provides a rapid, data-driven system for metadata validation and quality control, accelerating discovery in ancient metagenomics research.

17
ZipStrain Enables Rapid and Precise Strain-Resolved Metagenomics

Ghadermazi, P.; Emerson, J. B.; Olm, M. R.

2026-05-22 bioinformatics 10.64898/2026.05.20.726564 medRxiv
Top 0.1%
25.4%
Show abstract

Strain-resolved metagenomics characterizes microbial communities at nucleotide-level resolution, enabling researchers to differentiate identical from closely related organisms and characterize population structure and gene content variation. Here we introduce ZipStrain, a program that performs highly accurate strain-resolved metagenomics over 500x faster than available methods while offering superior RAM management. Applied to a dataset of 2,754 samples spanning human populations, we identify a strain-sharing gradient across social relationships, reveal striking variation in clonal structure across bacteria and bacteriophage, and pinpoint genes whose nucleotide identity deviates from genome-wide expectations. ZipStrain is distributed as an open-source Python package and accompanying Nextflow pipeline at https://github.com/OlmLab/ZipStrain.

18
ESGI: Efficient splitting of generic indices in single-cellsequencing data

Stohn, T.; van de Brug, N. D.; Theodosiadou, A.; Thijssen, B.; Jastrzebski, K.; Wessels, L. F. A.; Bosdriesz, E.

2026-03-06 bioinformatics 10.64898/2026.03.04.709594 medRxiv
Top 0.1%
25.2%
Show abstract

Single-cell sequencing technologies increasingly rely on complex nucleotide barcoding schemes to encode cellular identities, experimental conditions, and multiple molecular modalities within a single experiment. While demultiplexing, alignment, and UMI-based quantification form the core preprocessing steps that transform raw sequencing reads into analyzable single-cell data, existing pipelines are often tightly coupled to specific experimental designs and typically assume fixed barcode positions and substitution-only error models. As a result, many emerging assays employing combinatorial, variablelength, or multimodal barcoding designs require custom, hard-coded preprocessing solutions that are difficult to generalize and maintain. Here, we present ESGI (Efficient Splitting of Generic Indices), a flexible and extendable framework for demultiplexing and processing single-cell sequencing data with arbitrary barcode architectures. ESGI operates directly on raw FASTQ files using a generic barcode pattern specification, supports barcode matching with insertions and deletions via Levenshtein distance, accommodates variable-length barcodes, and provides detailed quality metrics for barcode assignment. ESGI optionally integrates genome alignment via STAR and performs feature quantification and UMI collapsing to generate cellby-feature count matrices. ESGI is well documented and readily applicable to novel single-cell experiments. We demonstrate the versatility of ESGI across six datasets spanning four distinct single-cell technologies, including combinatorial indexing-based transcriptomic and multimodal assays, feature barcode-based protein measurements, and spatial barcoding data. Across these applications, ESGI robustly demultiplexes complex barcode designs that are not natively supported by existing pipelines, while producing results comparable to established workflows where applicable. Together, ESGI provides a general and future-proof solution for preprocessing single-cell sequencing data, enabling rapid adoption and analysis of emerging experimental designs.

19
NanoCortex: A Unified Agentic System for Nanopore Sequencing Analysis

Xia, Q.; Wang, Z.; Shokoufandeh, M.; Rouhanifard, S. H.; Wanunu, M.

2026-05-21 bioinformatics 10.64898/2026.05.19.726254 medRxiv
Top 0.1%
23.2%
Show abstract

Nanopore sequencing has enabled various layers of information about DNA and RNA sequence isoforms and chemical modifications. Yet, the archipelago of disjoint nanopore analysis tools makes navigating among these a significant challenge for the nanopore user. We present NanoCortex, a unified autonomous agentic framework designed to bridge this shortcoming by providing end-to-end data processing which ranges from raw signal basecalling to biological interpretation. Built upon Gemini API services that incur usage-based API costs and orchestrated through the Gemini Agent Development Kit (ADK), the system utilizes a multi-agent architecture to autonomously perform task parsing, code generation, iterative code-level self-correction of code, and scientific interpretation. Following code generation, the code can be used offline. Benchmarking reveals that NanoCortex achieves significantly higher usability across complex analytical tasks compared to general-purpose large language models. The framework seamlessly integrates experimental data with meta-analysis of publicly available, biological databases to facilitate the extraction of biologically meaningful insights from sequencing data without cumbersome computational steps.

20
Fixative eXchange (FX)-seq: Scalable Single-nucleus RNA Sequencing Analysis of PFA-fixed or FFPE Tissue

Park, H.-E.; Lee, Y. T.; Lee, J.; Ji, H.; Song, Y.-L.; Lee, J. W.; Kim, S.-Y.; Hur, J. K.; Kim, E.; Lee, C. W.; Han, Y. D.; Kim, H.; Sohn, C. H.

2026-03-07 genomics 10.64898/2026.03.05.709668 medRxiv
Top 0.1%
23.1%
Show abstract

Single-nucleus RNA sequencing (snRNA-seq) of clinical formalin-fixed, paraffin-embedded (FFPE) samples has long been a challenge due to low reverse transcription (RT) yields. Here, we present Fixative-eXchange (FX)-seq, a highly scalable snRNA-seq method for heavily paraformaldehyde (PFA)-fixed and/or FFPE samples. We employ an organocatalyst to facilitate the removal of PFA crosslinks to increase RT yield and additional regiospecific Pt(II)-based crosslinking of RNA molecules to prevent leakage. FX-seq reveals cellular heterogeneity across multiple fixed samples by analyzing 321,710 nuclei, including PFA-fixed tissue, FFPE blocks, thin FFPE and hematoxylin and eosin (H&E)-stained sections from mouse brain and human cancer specimens such as gastrointestinal stromal tumor and colorectal cancer. FX-seq enables integrated analysis with pathologist annotation to label tumor and non-tumor regions of H&E-stained sections. FX-seq can also be applied to PFA-perfusion-based animal studies, large human cohort studies, and personalized drug treatment through precision medicine.